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EXECUTIVE  SUMMARY 


OBJECTIVE 

This  report  describes  experiments  designed  to  evaluate  the  usefulness  of  a  specific  algorithm  for  clas¬ 
sifying  images  of  commercial  ships  by  class.  This  algorithm  uses  a  technique  known  as  sparse  coding  to 
represent  images  for  classification.  The  sparse  coding  algorithm  is  compared  with  another  algorithm  evalu¬ 
ated  in  previous  publications. 


RESULTS 

The  sparse  coding  algorithm  is  shown  to  perform  approximately  as  well  as  the  algorithm  it  is  com¬ 
pared  with  and  does  not  appeal-  to  offer  any  improvement. 


RECOMMENDATIONS 


Additional  research  is  required  to  identify  algorithms  best  suited  for  the  ship  classification  task. 
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1.  INTRODUCTION 


Automated  vessel  detection  and  recognition  is  an  important  goal  for  many  Navy  applications.  The  au¬ 
tomated  classification  of  merchant  ships  would  assist  imagery  analysts  and  provide  greater  maritime  do¬ 
main  awareness.  This  is  a  challenging  problem,  in  part  due  to  the  nature  of  ship  imagery.  Ship  recognition 
algorithms  must  be  able  to  handle  variations  in  resolution  and  illumination  conditions  and  broadly  defined 
ship  categories. 

Previous  work  by  researchers  at  Space  and  Naval  Warfare  Systems  Center  Pacific  (SSC  Pacific)  [5, 
12-15]  investigated  several  recognition  algorithms  and  tested  them  on  several  data  sets  of  ship  images 
from  satellite  imagery.  Those  works  considered  various  methods  for  image  representation  and  classifica¬ 
tion,  two  general  steps  in  the  image  recognition  process.  A  digital  image  can  be  represented  as  a  vector  of 
pixel  values,  or  through  a  more  involved  algorithm  that  attempts  to  capture  semantic  value  from  the  pixels. 
Once  an  image  is  represented  numerically,  that  representation  can  be  passed  to  a  classifier  which  will  apply 
a  semantic  label  to  the  image.  Several  algorithms  for  representation  and  classification  arc  detailed  in  [13]. 

Some  of  the  highest  accuracy  results  in  that  work  were  obtained  by  constructing  image  representations 
with  the  bag  of  (visual)  words  (BOW)  method  [19].  The  BOW  algorithm  first  extracts  local  feature  de¬ 
scriptors  from  an  image  using  the  Scale-Invariant  Feature  Transform  (SIFT)  [8],  The  descriptors  are  then 
clustered  and  pooled  with  respect  to  a  dictionary  of  vocabulary  features  obtained  from  training  imagery. 
The  image  is  represented  as  a  histogram  of  its  pooled  features.  One  advantage  this  method  has  over  other 
representation  methods  is  that  the  dimensions  of  its  output  representation  are  independent  of  the  dimen¬ 
sions  of  its  input  image,  which  means  that  two  images  with  different  sizes  will  have  the  same  size  repre¬ 
sentations.  This  makes  the  BOW  method  useful  for  data  sets  of  images  with  non-uniform  dimensions,  such 
as  the  ship  data  described  in  [14],  Several  variations  on  the  BOW  method  were  compared  in  [13], 

Several  classification  methods  were  investigated  in  [13],  including  Support  Vector  Machines  (SVM), 
which  have  been  used  to  much  success  for  many  different  recognition  tasks  [16],  An  SVM  is  a  type  of 
linear-  classifier  that  is  designed  to  maximize  the  margin  of  the  decision  boundary  between  positive  and 
negative  examples,  or  support  vectors.  The  highest  accuracy  rates  in  [13]  were  achieved  with  BOW  image 
representations  and  SVM  classifiers. 

Another  classification  method  considered  in  [13],  sparse  representation-based  classification  (SRC), 
classifies  an  image  representation  by  first  expressing  it  as  a  sparse  linear-  combination  of  the  columns  of  a 
dictionary  matrix  learned  from  training  images.  This  method  was  used  by  Wright,  Yang,  Ganesh,  Sastry, 
and  Ma  [18]  to  classify  images  of  faces  represented  by  randomly  selected  pixel  values. 

Sparse  codes  have  also  been  used  for  image  representation,  rather  than  for  classification.  An  algorithm 
referred  to  as  ScSPM1  has  been  used  for  image  representation  by  some  authors  as  a  replacement  for  BOW 
[6,  21].  In  the  BOW  method,  SIFT  descriptors  are  extracted  and  then  quantized  with  respect  to  a  dictio¬ 
nary.  The  quantization  step  means  that  each  descriptor  is  associated  with  the  one  dictionary  element  to 
which  it  is  most  similar.  The  ScSPM  method  replaces  quantization  with  sparse  coding;  each  descriptor  is 
associated  with  the  coefficients  of  a  linear  combination  of  dictionary  elements  that  approximates  the  de¬ 
scriptor.  More  information  about  a  descriptor  can  be  retained  through  sparse  coding,  since  it  can  be  associ¬ 
ated  with  more  than  one  dictionary  element. 

Any  number  of  SIFT  descriptors  may  be  extracted  from  a  given  image,  so  it  is  necessary  to  some  how 
pool  the  quantized  descriptors  into  a  single  representation  of  the  image.  The  BOW  method  pools  descrip¬ 
tors  into  a  histogram,  but  this  discards  any  spatial  information  from  the  descriptors.  The  ScSPM  method 

'Similar  techniques  were  used  by  Yang,  Yu,  Gong,  and  Huang  [21],  who  called  their  method  ScSPM  to  refer  to  sparse  coding 
and  Spatial  Pyramid  Matching,  and  by  Ji,  Theiler,  Chartrand,  Kenyon,  and  Brumby  [6],  who  call  their  method  SIFT-based  sparse 
coding. 
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pools  descriptors  using  a  Spatial  Pyramid  Matching  (SPM)  algorithm  [7]  that  concatenates  histograms 
from  different  regions  of  the  image  at  multiple  scales. 

In  this  report,  we  compare  the  effectiveness  of  the  ScSPM  algorithm  for  classifying  ship  imagery  with 
that  of  BOW.  Both  algorithms  arc  used  to  represent  ship  images,  and  the  representations  arc  fed  into  an 
SVM  for  classification.  Our  results  show  similar  classification  accuracy  using  both  methods,  and  suggest 
that  some  of  the  perceived  advantages  of  ScSPM  may  not  make  a  difference  with  certain  data  sets.  This 
report  is  organized  as  follows:  the  algorithms  considered  arc  detailed  in  Section  2,  the  experiments  arc 
described  in  Section  3,  and  the  report  is  concluded  in  Section  4. 


2.  DESCRIPTION  OF  ALGORITHMS 


We  perform  image  classification  using  two  different  methods  for  image  representation — ScSPM  and 
BOW — each  combined  with  an  SVM  classifier.  This  section  describes  the  details  of  the  implementations 
of  the  two  image  representation  methods. 

Both  image  representation  methods  arc  based  on  SIFT  feature  descriptors.  Typically,  the  SIFT  algo¬ 
rithm  identifies  salient  keypoints  in  an  image,  then  computes  128-dimensional  descriptors  of  the  region 
surrounding  the  keypoints  using  a  histogram  of  local  gradients.  These  descriptors  can  be  used  to  match 
an  object  in  different  images,  even  under  changes  of  scale  or  illumination.  For  this  work,  we  use  two  vari¬ 
ations  on  this  algorithm.  Dense  SIFT  computes  SIFT  descriptors  at  a  dense  grid  of  points,  rather  than  at 
keypoints.  This  implementation  is  faster  than  traditional  SIFT.  Dense  SIFT  extracts  features  at  a  single 
scale,  but  Bosch,  Zisserman,  and  Muoz  [1]  proposed  a  method  called  pyramid  histogram  of  visual  words 
(PHOW)  which  extracts  dense  SIFT  features  at  multiple  scales.  Both  dense  SIFT  and  PHOW  arc  imple¬ 
mented  using  the  VLFeat  library  [17],  For  either  method,  given  an  image  X,  we  compute  a  set  of  p  de¬ 
scriptors,  'ip  =  yi,  y2,  •  •  •  yp,  with  y*  G  M128  for  each  i  =  1,  •••  ,p.  The  value  of  p  varies  with  each 
image. 

Both  representation  methods  also  rely  on  a  dictionary  formed  from  a  set  of  M  training  images, 

XT  =  {X±,  •  •  •  ,  XM}- 


Each  training  image  Xj  provides  pj  descriptors,  so  we  obtain  a  set 


V’t  =  {y[1}, 


v(1) 

1  jp  i  ’ 


,y[M\ 


containing  all  of  the  descriptors  for  all  of  the  training  images.  Methods  for  constructing  a  dictionary  from 
these  descriptors  arc  described  in  the  sections  below. 


2.1  SPARSE  CODED  FEATURES  (SCSPM) 

The  aim  of  sparse  coding  is  to  express  an  input  signal  y  G  Mnxl  as  a  linear  combination  of  the  columns 
a  dictionary  matrix  D  G  W,  xh .  The  coefficients  of  this  linear  combination  arc  stored  in  the  vector  a  G  MAxl, 
which  can  be  used  to  represent  the  input  y.  That  is,  we  want  to  solve  the  expression 

y  =  Da  (1) 

for  a.  D  should  be  overcomplete  (K  »  n)  to  ensure  that  there  arc  more  than  enough  columns  with 
which  to  express  any  given  input  y.  If  D  is  overcomplete,  then  there  exist  infinitely  many  solutions  to 
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Equation  (1).  We  want  a  to  be  sparse,  with  most  of  its  entries  zero,  which  means  that  only  a  few  of  the 
columns  of  the  dictionary  contribute  significantly  to  the  representation  of  y.  Therefore  we  want  to  find  the 
sparsest  possible  solution  to  Equation  (1).  This  requirement  can  be  expressed  as 

argrnin  17(a)  such  that  y  =  Da,  (2) 

aSRK 


or 

argmin  ||y  —  Da^  +  Af2(a),  (3) 

aeRK 

where  O(-)  is  a  sparsity-enforcing  function  and  A  is  a  weighting  parameter.  The  immediately  obvious 
choice  for  Q  is  the  so-called  “norm”  ||-||0  that  counts  the  number  of  non-zero  entries  of  a  vector,  but  that 
makes  Problem  (3)  non-convex  and  therefore  computationally  challenging  to  solve.  Another  option  for  Q 
is  the  l\  norm  ||  •  ||  1,  a  true  norm.  This  option  has  been  shown  to  yield  sparse  solutions  [3]. 

2.1.1  Dictionary  Learning 


Many  options  exist  for  constructing  a  dictionary  from  a  set  of  training  images  \T-  To  begin  with,  we 
use  the  dense  SIFT  algorithm  to  compute  the  set  Ay  containing  all  of  the  SIFT  descriptors  from  all  of  the 
training  images,  then  randomly  select  K  descriptors  from  this  set.  The  K  descriptors  can  simply  be  con¬ 
catenated  into  a  matrix  that  can  serve  as  a  dictionary,  but  that  may  not  lead  to  especially  sparse  solutions. 
A  pre-defined  dictionary  of  basis  vectors,  such  as  one  constructed  with  wavelets,  may  be  used  instead. 
Much  research  has  been  conducted  into  how  best  to  learn  a  dictionary  that  is  specially  adapted  to  training 
data  and  that  lends  itself  to  sparse  representation,  and  how  the  dictionary  can  be  computed  efficiently  [10]. 
Some  dictionary  learning  algorithms  have  been  designed  for  a  specific  task  such  as  classification  [1 1]  or 
image  denoising  [4].  Several  dictionary  learning  methods  arc  compared  in  [6]  for  their  use  for  sparse  cod¬ 
ing  of  image  features.  In  the  experiments  described  in  this  report,  we  use  the  SPArse  Modeling  Software 
(SPAMS)  package  [9,  10]  to  solve  the  l\ -regularized  problem 


argmin  -  V  min 

DeR "XX  Nt  y^T  aSRK 


1 

2 


ly  -  Da||o  +  A  1 1 a| 


r » 


(4) 


where  Nt  is  the  number  of  signals  in  the  training  set  Ay,  and  I\  is  a  fixed  value  indicating  the  desired  size 
of  the  dictionary. 

We  next  use  our  learned  dictionary  D  to  compute  the  ScSPM  representation  of  an  image  X.  For  each 
y  £  Ac  where  ip  is  the  set  of  dense  SIFT  feature  descriptors  extracted  from  X,  we  find  the  sparse  code  a 
with  respect  to  D  by  solving  Problem  (3).  We  solve  the  l\  regularization  of  (3),  with  Q(-)  =  ||-||,  using  the 
SPAMS  package.  This  leaves  us  with  a  set  A  y  =  {ai,  •  •  •  ,  ap}  containing  the  p  sparse  codes  correspond¬ 
ing  to  the  p  dense  SIFT  descriptors  from  the  image  X. 


2.1.2  Spatial  Pyramid  Matching 


Our  next  step  is  to  pool  the  p  descriptors  into  a  single  representation  of  X,  for  which  we  use  the  Spa¬ 
tial  Pyramid  Matching  (SPM)  algorithm  proposed  by  Fazebnik,  Schmid,  and  Ponce  [7].  We  divide  X  into 
three  different  partitions,  4  x  4,  2  x  2,  and  lxl,  for  a  total  of  21  segments.  Each  segment  is  home  to  a  sub¬ 
set  of  descriptors  of  X  collected  in  ip-,  we  notate  the  subset  of  ip  associated  with  the  Zth  segment  by  ipp  We 
then  pool  all  of  the  descriptors  from  a  given  segment  by  selecting  the  maximum  value  component- wise, 
giving  us  the  vector  z/  =  [z'p]  £  Mft",  where  for  each  j  =  1,  •  •  •  .  /A, 

z^P  =  max  a,-,  (5) 

J  a  eA>( 
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where  u}  is  the  jth  component  of  a. 

At  this  point  we  have  21  vectors  zi,  •  •  •  ,  Z21  E  MA .  By  pooling  vectors  from  the  segments  of  several 
partitions,  we  are  capturing  information  from  different  spatial  regions  and  on  different  scales.  Our  last  step 
is  to  concatenate  these  21  vectors  into  one  vector  z  <G  M21*A .  The  vector  z  is  the  ScSPM  representation  of 
the  original  image  X. 


2.2  BAG  OF  WORDS  (BOW) 

Bag  of  (visual)  words  (BOW)  is  a  feature  extraction  approach  inspired  by  the  bag  of  words  represen¬ 
tation  used  in  text  classification  tasks  [20].  In  text  applications,  BOW  treats  a  document  as  a  collection  of 
words  independent  of  each  other,  ignoring  the  order  and  context  in  which  the  words  are  used.  In  image  ap¬ 
plications,  BOW  represents  an  image  as  a  histograms  of  its  local  features,  using  feature  descriptors  such 
as  SIFT.  For  this  work  we  used  the  PHOW  algorithm  to  extract  dense  SIFT  feature  descriptors  at  multiple 
scales. 

Whereas  the  ScSPM  algorithm  represents  images  with  respect  to  a  dictionary,  BOW  represents  images 
with  respect  to  a  vocabulary  constructed  from  the  set  of  training  images  \T-  Using  the  VLFeat  software 
library  [17],  we  compute  PHOW  feature  descriptors  for  each  image  in  xt  and  collect  them  in  the  set  'ipr- 
We  then  fix  the  parameter  K  and  cluster  the  descriptors  into  K  clusters  using  the  A-mcans  algorithm.  This 
gives  us  a  matrix  V  =  [v*]^)=1  E  MnxA  (where  n  is  the  length  of  each  descriptor,  in  our  case  128)  whose 
columns  v,;  arc  the  cluster  centers,  or  “words.” 

Given  an  image  X,  we  compute  the  set  ip  containing  its  PHOW  descriptors.  For  each  y  E  ip,  we  find 
the  “word”  that  it  is  closest  to,  a  step  referred  to  as  quantization.  We  define  a  function  Q(-)  by  which  to 
associate  y  with  the  index  of  its  closest  “word”,  that  is, 

Q( y)  =  argmin  ||v*  -  y||2  .  (6) 

i=l,-K 

After  quantizing  each  descriptor,  we  compile  a  histogram  z  =  [zi]f=l  reflecting  how  many  of  each  “word” 
arc  represented  in  X,  so  for  each  i  =  1,  •  •  •  K 

Zi  =  |{y  e  ip\Q(y)  =  *}|-  (7) 

This  histogram  z  E  MA  is  the  BOW  representation  of  the  image  X. 


3.  EXPERIMENTS 

We  compared  the  effectiveness  of  the  ScSPM  and  BOW  algorithms  for  image  representations  by  using 
them  to  classify  a  four-class  set  of  ship  images  chipped  from  satellite  imagery.  The  data  set  contains  200 
images  per  class,  and  each  image  is  repeated  under  various  degrees  of  pre-processing  to  give  four  separate 
data  sets;  the  original  images  have  non-uniform  dimensions  and  have  ships  pointed  in  every  direction,  the 
rotated  images  have  all  the  ships  pointing  up,  the  cropped  images  have  some  excess  background  removed, 
and  the  resized  images  arc  all  300  x  150  pixels  and  have  bow,  stern,  port,  and  starboard  points  on  each  ship 
aligned.  This  data  set  is  described  in  more  detail  in  [14]. 

We  tested  each  representation  algorithm  with  two  different  values  of  K  (1000  and  2000  for  BOW  and 
1024  and  2048  for  ScSPM),  which  represents  the  dictionary  size  for  ScSPM  and  the  vocabulary  size  for 
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BOW.  For  each  pre-processing  type,  we  divided  the  images  into  80%/20%  splits,  using  160  images  per 
class  for  training  and  the  remainder  for  testing.  We  classified  the  testing  images  with  an  SVM,  imple¬ 
mented  with  LibSVM  [2],  using  a  linear  kernel.  We  classified  each  data  set  five  times,  each  time  with  a 
different  split  used  for  testing.  Table  1  shows  the  average  classification  accuracy  over  the  five  runs,  for 
each  representation  algorithm,  dictionary  or  vocabulary  size,  and  data  type. 

One  distinction  between  the  ScSPM  and  BOW  algorithms  is  that  ScSPM  preserves  some  spatial  infor¬ 
mation,  but  BOW  does  not.  It  is  not  surprising  that  on  the  original  data  set  there  is  a  four  percentage  point 
drop  in  average  accuracy  from  BOW  to  ScSPM  when  K  =  1000  or  1024,  and  a  six  percentage  point  drop 
when  K  =  2000  or  2048.  This  is  because  the  original  data  is  not  spatially  uniform,  so  different  regions 
of  the  ships  are  not  in  the  same  locations  from  image  to  image.  For  the  other  data  sets,  all  of  which  have 
some  spatial  uniformity,  the  results  between  the  two  algorithms  arc  similar,  with  no  clear  advantage  to  ei¬ 
ther. 


Table  1.  Average  classification  accuracies  on  various  data  sets  using  either  BOW  or  ScSPM  image 
representations. 


BOW 

K  =  1000 

K  =  2000 

Original  Data 

76.0 

79.4 

Rotated 

91.5 

91.5 

Cropped 

92.3 

92.8 

Resized 

94.0 

94.8 

ScSPM 

K  =  1024 

K  =  2048 

Original  Data 

72.0 

73.3 

Rotated 

90.3 

90.0 

Cropped 

93.9 

94.1 

Resized 

95.0 

94.3 

4.  CONCLUSION 

This  report  contains  results  from  experiments  testing  the  hypothesis  that  the  ScSPM  algorithm,  which 
produces  sparse  coded  image  representations  with  learned  dictionaries,  will  provide  improvement  to  clas¬ 
sification  accuracy  over  the  BOW  methods  tested  on  ship  imagery  in  [13,  14].  This  hypothesis  does  not 
hold,  and  our  results  suggest  that  ScSPM  is  not  as  effective  as  BOW  on  data  with  no  spatial  alignment.  On 
aligned  data  the  two  algorithms  perform  similarly;  however,  the  performance  of  ScSPM  algorithm  may  be 
improved  by  different  parameter  selections.  Many  variations  of  BOW  were  detailed  and  tested  in  [13],  and 
the  valiant  described  in  this  report  produced  the  best  results  of  all  those  tested.  There  are  several  possible 
ways  to  improve  ScSPM,  none  of  which  have  been  tried  on  this  ship  data.  Future  experimentation  will  test 
other  ways  to  solve  the  dictionary  learning  Problem  (4)  as  well  as  the  sparse  coding  Problem  (3). 
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